Succinct Data Structure
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a succinct data structure is a
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
which uses an amount of space that is "close" to the
information-theoretic Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. T ...
lower bound, but (unlike other compressed representations) still allows for efficient query operations. The concept was originally introduced by Jacobson to encode
bit vector A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level pa ...
s, (unlabeled)
trees In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are u ...
, and
planar graph In graph theory, a planar graph is a graph that can be embedded in the plane, i.e., it can be drawn on the plane in such a way that its edges intersect only at their endpoints. In other words, it can be drawn in such a way that no edges cross ...
s. Unlike general
lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...
algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a
compressed data structure The term compressed data structure arises in the computer science subfields of algorithms, data structures, and theoretical computer science. It refers to a data structure whose operations are roughly as fast as those of a conventional data structu ...
, in which the size of the data structure depends upon the particular data being represented. Suppose that Z is the information-theoretical optimal number of bits needed to store some data. A representation of this data is called: * ''
implicit Implicit may refer to: Mathematics * Implicit function * Implicit function theorem * Implicit curve * Implicit surface * Implicit differential equation Other uses * Implicit assumption, in logic * Implicit-association test, in social psychology ...
'' if it takes Z + O(1) bits of space, * ''succinct'' if it takes Z + o(Z) bits of space, and * ''compact'' if it takes O(Z) bits of space. For example, a data structure that uses 2Z bits of storage is compact, Z + \sqrt bits is succinct, Z + \lg Z bits is also succinct, and Z + 3 bits is implicit. Implicit structures are thus usually reduced to storing information using some permutation of the input data; the most well-known example of this is the heap.


Succinct dictionaries

Succinct indexable dictionaries, also called ''rank/select'' dictionaries, form the basis of a number of succinct representation techniques, including
binary trees In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binary tr ...
, k-ary trees and
multisets In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the multiplicity of that ...
, as well as
suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
s and
arrays An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
. The basic problem is to store a subset S of a universe U = \dots n) = \, usually represented as a bit array B[0 \dots n) where B[i= 1 iff i \in S. An indexable dictionary supports the usual methods on dictionaries (queries, and insertions/deletions in the dynamic case) as well as the following operations: *\mathbf_q(x) = , \, *\mathbf_q(x)= \min \ for q \in \. In other words, \mathbf_q(x) returns the number of elements equal to q up to position x while \mathbf_q(x) returns the position of the x-th occurrence of q . There is a simple representation which uses n + o(n) bits of storage space (the original bit array and an o(n) auxiliary structure) and supports rank and select in constant time. It uses an idea similar to that for Range Minimum Query, range-minimum queries; there are a constant number of recursions before stopping at a subproblem of a limited size. The bit array B is partitioned into ''large blocks'' of size l = \lg^2 n bits and ''small blocks'' of size s = \lg n / 2 bits. For each large block, the rank of its first bit is stored in a separate table R_l[0 \dots n/l); each such entry takes \lg n bits for a total of (n/l) \lg n = n / \lg n bits of storage. Within a large block, another directory R_s[0 \dots l/s) stores the rank of each of the l/s = 2 \lg n small blocks it contains. The difference here is that it only needs \lg l = \lg \lg^2 n = 2 \lg \lg n bits for each entry, since only the differences from the rank of the first bit in the containing large block need to be stored. Thus, this table takes a total of (n/s) \lg l = 4 n \lg \lg n / \lg n bits. A lookup table R_p can then be used that stores the answer to every possible rank query on a bit string of length s for i \in [0, s); this requires 2^s s \lg s = O(\sqrt \lg n \lg \lg n) bits of storage space. Thus, since each of these auxiliary tables take o(n) space, this data structure supports rank queries in O(1) time and n + o(n) bits of space. To answer a query for \mathbf_1(x) in constant time, a constant time algorithm computes: * \mathbf_1(x) = R_l[\lfloor x / l \rfloor] + R_s[\lfloor x / s\rfloor] + R_p[x \lfloor x / s\rfloor, x \text s] In practice, the lookup table R_p can be replaced by bitwise operations and smaller tables that can be used to find the number of bits set in the small blocks. This is often beneficial, since succinct data structures find their uses in large data sets, in which case cache misses become much more frequent and the chances of the lookup table being evicted from closer CPU caches becomes higher. Select queries can be easily supported by doing a binary search on the same auxiliary structure used for rank; however, this takes O(\lg n) time in the worst case. A more complicated structure using 3n/\lg \lg n + O(\sqrt \lg n \lg \lg n) = o(n) bits of additional storage can be used to support select in constant time. In practice, many of these solutions have hidden constants in the O(\cdot) notation which dominate before any asymptotic advantage becomes apparent; implementations using broadword operations and word-aligned blocks often perform better in practice.


Entropy-compressed dictionaries

The n + o(n) space approach can be improved by noting that there are \textstyle \binom distinct m-subsets of [n) (or binary strings of length n with exactly m 1’s), and thus \textstyle \mathcal(m,n) = \lceil \lg \binom \rceil is an information theoretic lower bound on the number of bits needed to store B. There is a succinct (static) dictionary which attains this bound, namely using \mathcal(m,n) + o(\mathcal(m,n)) space. This structure can be extended to support rank and select queries and takes \mathcal(m,n) + O(m + n \lg \lg n / \lg n) space. Correct rank queries in this structure are however limited to elements contained in the set, analogous to how minimal perfect hashing functions work. This bound can be reduced to a space/time tradeoff by reducing the storage space of the dictionary to \mathcal(m,n) + O(n t^t / \lg^t n + n^) with queries taking O(t) time.


Examples

A
null-terminated string In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character (a character with a value of zero, called NUL in this article). Alternative names are C stri ...
(C string) takes ''Z'' + 1 space, and is thus implicit. A string with an arbitrary length (Pascal string) takes ''Z'' + log(''Z'') space, and is thus succinct. If there is a maximum length – which is the case in practice, since 232 = 4 GiB of data is a very long string, and 264 = 16 EiB of data is larger than any string in practice – then a string with a length is also implicit, taking ''Z'' + ''k'' space, where ''k'' is the number of data to represent the maximum length (e.g., 64 bits). When a sequence of variable-length items (such as strings) needs to be encoded, there are various possibilities. A direct approach is to store a length and an item in each record – these can then be placed one after another. This allows efficient next, but not finding the ''k''th item. An alternative is to place the items in order with a delimiter (e.g.,
null-terminated string In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character (a character with a value of zero, called NUL in this article). Alternative names are C stri ...
). This uses a delimiter instead of a length, and is substantially slower, since the entire sequence must be scanned for delimiters. Both of these are space-efficient. An alternative approach is out-of-band separation: the items can simply be placed one after another, with no delimiters. Item bounds can then be stored as a sequence of length, or better, offsets into this sequence. Alternatively, a separate binary string consisting of 1s in the positions where an item begins, and 0s everywhere else is encoded along with it. Given this string, the select function can quickly determine where each item begins, given its index. This is ''compact'' but not ''succinct,'' as it takes 2''Z'' space, which is O(''Z''). Another example is the representation of a
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binary t ...
: an arbitrary binary tree on n nodes can be represented in 2n + o(n) bits while supporting a variety of operations on any node, which includes finding its parent, its left and right child, and returning the size of its subtree, each in constant time. The number of different binary trees on n nodes is /(n+1). For large n, this is about 4^n; thus we need at least about \log_2(4^n)=2n bits to encode it. A succinct binary tree therefore would occupy only 2 bits per node.


See also

*
Minimal perfect hash function In computer science, a perfect hash function for a set is a hash function that maps distinct elements in to a set of integers, with no collisions. In mathematical terms, it is an injective function. Perfect hash functions may be used to imp ...


References

{{DEFAULTSORT:Succinct Data Structure